STA4173: Biostatistics
January 1, 2025
In this lecture, we will review summary statistics
Continuous variables
Categorical variables
In this course, we will review formulas, but we will use R for computational purposes
Remember to refer to the lecture notes for specific code needed
Code is also available on this course’s GitHub repository
We can use base R for some things, but I try to stay in the tidyverse when possible.
If we need to install packages, we use the install.packages() function,
library() function,\bar{x} = \frac{\sum_{i=1}^n x_i}{n} = \frac{x_1 + x_2 + ... + x_n}{n}
s^2 = \frac{\sum_{i=1}^n x_i^2 - \frac{(\sum_{i=1}^n x_i)^2}{n}}{n-1}
s = \sqrt{s^2}
Definition: median
The value that lies in the middle of the data when arranged in ascending order.
If n is odd, then the median is literally the middle number.
If n is even, then the median is the average of the two middle numbers.
R syntax:
Definition: kth percentile, Pk
Definition: quartiles
Definition: five number summary
0% 25% 50% 75% 100%
62.00 71.75 79.50 87.00 94.00
Definition: interquartile range
When we are dealing with categorical data, we summarize using frequency tables.
e.g., from the UWF Fact Book, in Fall 2021, there were
R syntax:
Consider the Motor Trends car road tests data, built into R.
The data was extracted from the 1974 Motor Trend magazine, and includes aspects of car design and performance for 32 cars (1973-74 models).
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# A tibble: 4 × 4
# Groups: am [2]
am gear n freq
<dbl> <dbl> <int> <dbl>
1 0 3 15 0.789
2 0 4 4 0.211
3 1 4 8 0.615
4 1 5 5 0.385
When presenting results to others, sometimes it is helpful to create a visualization.
Continuous data:
Categorical data:
Related to analyses:
We can also use color to incorporate other variables
We will use the ggplot2 package for most of our graphing needs.
tidyverse package.A good reference book is the official ggplot2: elegant graphics for data analysis text.
I will often google keywords + ggplot2 and look for examples that provide code.
e.g., “histogram ggplot2” led me to this website
e.g., “change color of dot ggplot2” led me to this website
Sometimes I have to look at several links before I find what I am looking for.
ggplot() function to specify our underlying canvas.tidyverse pipe operator (%>%) to pipe data into the ggplot() function.aes() inside of ggplot().We will add elements to our graph using geom_ functions.
geom_line() creates a linegeom_point() creates a scatterplotgeom_bar() creates a bar chartgeom_text() puts text on the graphgeom_ functions on the tidyverse websiteThe order that you add them matters!
geom_line() + geom_point() = points on top of linegeom_point() + geom_line() = line on top of pointsWe can also customize every aspect of our graphs.
e.g., the default background is gray, but I personally do not like it, so I typically use theme_minimal() or theme_bw() to give a white background
e.g., we can increase the font size to make things readable
e.g., we can specify colors for: markers (dots/points), outline of a bar chart or histogram, filling of a bar chart or histogram, lines, text, etc.
There are additional functions within other (non-tidyverse) packages that will help us with customization.
We can put graphs together using the ggarrange() function in the ggpubr package
We can use geom_emoji() from the the emoGG to display emojis in graphs :)
I do not expect you to become an expert in data visualization
As with other R code, I will provide basic code during lecture
I do encourage curiosity and exploring further
R is a very, very powerful tool for graphing!
Even before I was An Official R Programmer©, I used ggplot2 to construct graphs.
Other programs are just not great. :(
Today we will look at graphs that go along with summary statistics, but we will learn other ways to graph data as we progress through the semester.
means <- mtcars %>%
group_by(cyl, am) %>%
summarize(mean = mean(mpg)) %>%
ungroup()
means %>%
ggplot(aes(y = mean, x = cyl, color = as.factor(am))) +
geom_point(size = 5) +
labs(x = "Horsepower",
y = "Gas Mileage",
color = "Transmission") +
scale_color_manual(labels = c("Automatic", "Manual"),
values = c("#003865", "#8DC8E8")) +
theme_bw() means <- mtcars %>%
group_by(cyl, am) %>%
summarize(mean = mean(mpg),
sd = sd(mpg)) %>%
ungroup()
means %>%
ggplot(aes(y = mean, x = cyl, color = as.factor(am))) +
geom_point(size = 5) +
geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width = 0.15) +
labs(x = "Horsepower",
y = "Gas Mileage",
color = "Transmission") +
scale_color_manual(labels = c("Automatic", "Manual"),
values = c("#003865", "#8DC8E8")) +
theme_bw() In lecture, we have reviewed how to describe data.
There is not a one-size-fits-all graph!
Next, we will review statistical inference.